unimelb: Spanish Text Normalisation

نویسندگان

Bo Han

Paul Cook

Timothy Baldwin

چکیده

This paper describes a lexicon-based text normalisation approach for Spanish tweets. We first compare English and Spanish text normalisation, and hypothesise that an approach previously proposed for English can be adapted to Spanish. A corpus-derived normalisation lexicon is built using distributional similarity, and is combined with existing lexicons (e.g., containing Spanish Internet slang). These lexicons enable a very fast, look-up based approach to text normalisation. Experimental results indicate that the corpus-derived lexicon complements existing lexicons, but that the approach could be improved through better handling of certain word types, such as named entities.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Han, Bo, Paul Cook and Timothy Baldwin (2013) unimelb: Spanish Text Normalisation, In Proceedings of the Tweet Normalization Workshop at SEPLN 2013 (Tweet-norm), Madrid, Spain, pp. 67-71

متن کامل

DLSI en Tweet-Norm 2013: Normalización de Tweets en Español

The lexical richness and its ease of access to large volumes of information converts the Web 2.0 into an important resource for Natural Language Processing. Nevertheless, the frequent presence of non-normative linguistic phenomena that can make any automatic processing challenging. In this paper is described the participation in the Text Normalisation Workshop at the SEPLN conference (Tweet-nor...

متن کامل

TexAFon 2.0: A text processing tool for the generation of expressive speech in TTS applications

This paper presents TexAfon 2.0, an improved version of the text processing tool TexAFon, specially oriented to the generation of synthetic speech with expressive content. TexAFon is a text processing module in Catalan and Spanish for TTS systems, which performs all the typical tasks needed for the generation of synthetic speech from text: sentence detection, pre-processing, phonetic transcript...

متن کامل

Improving the utility of social media with Natural Language Processing

Social media has been an attractive target for many natural language processing (NLP) tasks and applications in recent years. However, the unprecedented volume of data and the non-standard language register cause problems for off-the-shelf NLP tools. This thesis investigates the broad question of how NLP-based text processing can improve the utility (i.e., the effectiveness and efficiency) of s...

متن کامل

Exploring Methods and Resources for Discriminating Similar Languages

The Discriminating between Similar Languages (DSL) shared task at VarDial challenged participants to build an automatic language identification system to discriminate between 13 languages in 6 groups of highly-similar languages (or national varieties of the same language). In this paper, we describe the submissions made by team UniMelb-NLP, which took part in both the closed and open categories...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2013

unimelb: Spanish Text Normalisation

نویسندگان

چکیده

منابع مشابه

Han, Bo, Paul Cook and Timothy Baldwin (2013) unimelb: Spanish Text Normalisation, In Proceedings of the Tweet Normalization Workshop at SEPLN 2013 (Tweet-norm), Madrid, Spain, pp. 67-71

DLSI en Tweet-Norm 2013: Normalización de Tweets en Español

TexAFon 2.0: A text processing tool for the generation of expressive speech in TTS applications

Improving the utility of social media with Natural Language Processing

Exploring Methods and Resources for Discriminating Similar Languages

عنوان ژورنال:

اشتراک گذاری